-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch memcpy the last offsets for output buffers of str and list cols in PQ reader #16905
Batch memcpy the last offsets for output buffers of str and list cols in PQ reader #16905
Conversation
Performance ImprovementsSummaryThe time to write final offsets for Benchmark SetupBenchmark name: Profiles (Before = top, After = bottom)Before: 1024 x cudaMemcpyAsync = 2.393ms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool stuff!
Bunch of small suggestions, mostly to polish the new functions
…://github.com/mhaseeb123/cudf into fea-batch-memcpy-list-str-output-buff-offsets
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice!
Co-authored-by: Vyas Ramasubramani <[email protected]>
/ok to test |
/merge |
Description
This PR adds the capability to batch memcpy the last offsets for the output buffers of string and list columns in PQ reader. This reduces the overhead from several
cudaMemcpyAsync
calls when reading wide strings and/or list columns tables. This optimization was found as well as ORC changes were contributed by @vuule. See this comment for performance improvement data and discussion.Checklist